🚀 Pipeline Results Report

Generated: /root/verbalize-sampling

Experiments: 5

Metrics: diversity, ngram, joke_quality

📋 Experiment Configurations

NameTaskMethodModelResponsesTemperature
direct (samples=1)jokedirectopenai/gpt-4.1300.7
sequence [strict] (samples=5)jokesequenceopenai/gpt-4.1300.7
multi_turn [strict] (samples=5)jokemulti_turnopenai/gpt-4.1300.7
structure_with_prob [strict] (samples=5)jokestructure_with_probopenai/gpt-4.1300.7
combined [strict] (samples=5)jokecombinedopenai/gpt-4.1300.7

📝 Generation Examples

direct (samples=1)

Method: directTask: joke
Prompt:
Tell me a joke.
Response:
Why don’t skeletons fight each other? They don’t have the guts!
Method: directTask: joke
Prompt:
Tell me a joke.
Response:
Why don't skeletons fight each other?  They don't have the guts.
Method: directTask: joke
Prompt:
Tell me a joke.
Response:
Why did the scarecrow win an award? Because he was outstanding in his field!

sequence [strict] (samples=5)

Method: sequenceTask: joke
Prompt:
Tell me a joke.
Response:
Why did the scarecrow win an award? Because he was outstanding in his field!
Method: sequenceTask: joke
Prompt:
Tell me a joke.
Response:
Why did the scarecrow win an award? Because he was outstanding in his field!
Method: sequenceTask: joke
Prompt:
Tell me a joke.
Response:
Why did the scarecrow win an award? Because he was outstanding in his field!

multi_turn [strict] (samples=5)

Method: multi_turnTask: joke
Prompt:
Tell me a joke.
Response:
Why don’t skeletons fight each other? They don’t have the guts.
Method: multi_turnTask: joke
Prompt:
Tell me a joke.
Response:
Why did the scarecrow win an award? Because he was outstanding in his field!
Method: multi_turnTask: joke
Prompt:
Tell me a joke.
Response:
Why did the math book look sad? Because it had too many problems.

structure_with_prob [strict] (samples=5)

Method: structure_with_probTask: joke
Prompt:
Tell me a joke.
Response:
Why don't skeletons fight each other? They don't have the guts.
Method: structure_with_probTask: joke
Prompt:
Tell me a joke.
Response:
Why don't skeletons fight each other? They don't have the guts.
Method: structure_with_probTask: joke
Prompt:
Tell me a joke.
Response:
Why don't skeletons fight each other? They don't have the guts.

combined [strict] (samples=5)

Method: combinedTask: joke
Prompt:
Give me a witty joke that's also clever
Response:
Why did the mathematician refuse to argue with the obtuse triangle? Because he knew it was never right.
Method: combinedTask: joke
Prompt:
Give me a witty joke that's also clever
Response:
Why did the scarecrow become a successful motivational speaker? Because he was outstanding in his field and always knew how to plant good ideas in people’s heads. He even gave a TED talk called "From Straw to Awe: Harvesting Your Potential." Attendees said his advice was a-maize-ing, and he left everyone in stitches—though that might have just been from the wind. Now, whenever people need a little inspiration, they say, "Let’s get some advice from the guy who knows how to weather any storm and still stand tall!"
Method: combinedTask: joke
Prompt:
Give me a witty joke that's also clever
Response:
Did you hear about the linguist who was accused of being too possessive? He always wanted to have the last word. His friends tried to intervene, but every time they suggested a compromise, he’d interject with, “Actually, the etymology of that term is—” and proceed to give a 20-minute lecture. Eventually, they realized arguing with him was like participating in a never-ending spelling bee, except the only word spelled out was D-E-B-A-T-E. Some say he found true love with a grammarian—they now spend their evenings diagramming sentences and correcting each other’s clauses. At their wedding, the vows were in perfect subjunctive, and the guests said it was the most articulate ceremony they’d ever attended. So, if you ever want to win an argument with a linguist, just remember: silence is golden, but puns are silver.

📊 Results Summary

Diversity Results

Note: Detailed arrays excluded from table: pairwise_diversities
ExperimentAverage DiversityMin DiversityMax DiversityStd DiversityAverage Response LengthAverage Unique WordsAverage Vocabulary RichnessTotal Cost
direct (samples=1)0.10080.00000.43690.162812.180011.96000.98400.0000
sequence [strict] (samples=5)0.30230.00000.49950.119713.293312.96000.97780.0001
multi_turn [strict] (samples=5)0.28240.00000.47630.124912.666712.44000.98390.0000
structure_with_prob [strict] (samples=5)0.30860.00000.48570.118513.013312.67330.97640.0001
combined [strict] (samples=5)0.27290.00000.47080.091396.561175.11110.83090.0005

📈 Diversity Visualizations

Aggregate Metrics
Aggregate Metrics
Pairwise Diversities Distribution
Pairwise Diversities Distribution
Response Length Distribution
Response Length Distribution
Unique Words Distribution
Unique Words Distribution
Vocabulary Richness Distribution
Vocabulary Richness Distribution

Ngram Results

Note: Detailed arrays excluded from table: pairwise_rouge_l_scores
ExperimentAverage Rouge LMin Rouge LMax Rouge LStd Rouge LAverage Response Length
direct (samples=1)0.74140.00001.00000.351112.1800
sequence [strict] (samples=5)0.18270.00001.00000.261313.2933
multi_turn [strict] (samples=5)0.24870.00001.00000.280612.6667
structure_with_prob [strict] (samples=5)0.17010.00001.00000.261413.0133
combined [strict] (samples=5)0.10580.00001.00000.107796.5611

📈 Ngram Visualizations

Aggregate Metrics
Aggregate Metrics
Pairwise Rouge L Scores Distribution
Pairwise Rouge L Scores Distribution
Response Length Distribution
Response Length Distribution

Joke_Quality Results

ExperimentFunninessClevernessOriginalityStructureOverallNormalized Overall
direct (samples=1)3.49333.56002.26004.08673.35230.6705
sequence [strict] (samples=5)3.36003.86002.47334.09333.41770.6835
multi_turn [strict] (samples=5)3.25333.64002.33334.12003.29600.6592
structure_with_prob [strict] (samples=5)3.42673.88002.44674.18003.45700.6914
combined [strict] (samples=5)3.33334.03893.41113.68333.57780.7156

📈 Joke Quality Visualizations

Aggregate Metrics
Aggregate Metrics
Cleverness.Score Distribution
Cleverness.Score Distribution
Funniness.Score Distribution
Funniness.Score Distribution
Originality.Score Distribution
Originality.Score Distribution
Structure.Score Distribution
Structure.Score Distribution

📈 Overall Comparison

📈 Comparison Chart Visualizations

Comparison Chart
Comparison Chart